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Abstract — In conventional supervised pattern recognition tasks, 
model selection is typically accomplished by minimizing the 
classification error rate on a set of so-called development data, 
subject to ground-truth labeling by human experts or some 
other means. In the context of speech processing systems and 
other large-scale practical applications, however, such labeled 
development data are typically costly and difficult to obtain. This 
article proposes an alternative semi-supervised framework for 
likelihood-based model selection that leverages unlabeled data by 
using trained classifiers representing each model to automatically 
generate putative labels. The errors that result from this auto- 
matic labeling are shown to be amenable to results from robust 
statistics, which in turn provide for minimax-optimal censored 
likelihood ratio tests that recover the nonparametric sign test as 
a limiting case. This approach is then validated experimentally 
using a state-of-the-art automatic speech recognition system to 
select between candidate word pronunciations using unlabeled 
speech data that only potentially contain instances of the words 
under test. Results provide supporting evidence for the utility 
of this approach, and suggest that it may also find use in other 
applications of machine learning. 

Index Terms — Likelihood ratio tests, pronunciation modeling, 
robust statistics, semi-supervised learning, sign test, speech recog- 
nition, spoken term detection. 

I. Introduction 

THIS article develops a simple and powerful likelihood- 
ratio framework that enables the use of unlabeled devel- 
opment data for model selection and system optimization in 
the context of large-scale speech processing. Within the speech 
engineering community, acoustic likelihoods have long played 
a prominent role both as a training criterion and an objective 
function to aid in system development. Log-likelihood ratios 
have in turn featured ever more prominently in areas such 
as speech, speaker, and language recognition; for instance, it 
is now common practice that "target" model likelihoods are 
compared to those of a universal "background" model as part 
of many large-scale speech processing systems (TJ. 

A. Model Selection Using Likelihood Ratios 

Comparing data likelihoods between competing models can 
serve as an effective means of model selection for clas- 
sification and regression tasks. However, when considering 
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conditional likelihoods of the observed data given labels such 
as orthographic transcriptions of speech waveforms, previous 
work has assumed that orthographic labels have been correctly 
assigned by human experts, and hence are known exactly. 
However, such "labeled data" do not come for free; their 
acquisition requires the time and expertise of a trained linguist, 
hence limiting scalability to the large sample sizes necessary 
to succeed in practical speech engineering tasks. 

This article thus posits a framework in which likelihoods 
evaluated using labels that are automatically assigned by two 
competing systems can serve as proxies for likelihoods based 
on ground-truth labeling. This yields not only a method- 
ologically sound algorithmic framework through which to 
incorporate unlabeled data into the likelihood-based model 
selection process, but also practical engineering strategies for 
selecting between competing models in order to optimize 
large-scale systems. Experiments to select between candidate 
word pronunciations in the context of state-of-the-art speech 
processing systems, using well-known corpora and standard 
metrics, serve to demonstrate the benefit of unlabeled devel- 
opment data in the context of large-scale speech processing. 

To construct this framework, insights from robust statistics 
are used to formulate the resultant semi-supervised model 
selection problem in a manner that permits principled analysis, 
and from which efficient and effective algorithms can be 
derived. By considering the automatic labeling procedure as a 
mixture of correct and incorrect assignments, the influence of 
incorrect labeling can be limited through what is known as a 
censored likelihood ratio evaluation. 

The well-known nonparametric sign test arises as a natural 
limiting procedure in this setting, and the technical develop- 
ment of this article shows how optimality properties derived 
by Huber Q can be applied in the semi-supervised setting 
to ensure that the maximal model selection error induced by 
automatic labeling is minimized. Thusly one arrives at an 
algorithmic procedure that compares the relative performance 
of two competing systems in order to test the significance of 
performance differences between them, and hence to select 
the model that is "closest" (in the sense of Kullback-Leibler 
divergence) to the true data-generating distribution. 

B. Unlabeled Data in the Context of Speech Processing 

To clarify the notions of supervised/semi-supervised learn- 
ing and labeled/unlabeled data in the speech processing 
context at hand, we briefly recall the standard machine 
learning paradigm as follows. Fundamentally, one assumes 
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the existence of an unknown joint probability distribution 
Px,y(x, y) px(x)py(u), from which a number of indepen- 
dent and identically distributed samples (x l5 yi), {x%, 2/2)1 ■ ■ • 
are available; these are termed training data, and are used to 
fit a model that predicts values taken by Y based on observed 
instances of X. In classification tasks Y is a discrete random 
variable, and its range of possible values comprises the set 
of labels — corresponding to, for example, an orthographic 
transcript of the word or phrase represented by an instance 
of acoustic waveform data X. 

The goal in a traditional supervised learning scenario is 
to devise algorithms that strike a balance between fidelity to 
the set of labeled examples {(x^yi)}, and effective general- 
ization to other as-of-yet unseen test data comprising addi- 
tional observations of X — a classical bias-variance trade-off 
between model goodness-of-fit and generalization properties. 
This trade-off is typically optimized by calculating empirical 
error rates on an additional "held-out" set of labeled data for 
which ground truth is known, in a manner similar to parameter 
estimation via cross-validation. 

Fitting a model to accomplish this goal is thus mathemat- 
ically equivalent to building a system, and one speaks of 
the "training" or model-building stage, and the "testing" or 
application stage, in which a system is subsequently deployed 
and put into practical use — and which assumes that both 
training and test data are drawn from the same probability 
distribution. When this assumption is satisfied, it is clear that 
speech engineering systems benefit directly from ever-greater 
amounts of labeled training data. Time, money, and expertise, 
however, typically limit the amount of such data available in 
any given application scenario of interest. It is thus of much 
interest to develop algorithms that are built using some amount 
of labeled training data, but whose performance can be further 
improved through careful use of unlabeled data — the so-called 
semi-supervised learning paradigm (3). 

Thus far, the application of semi-supervised methods to 
speech processing has been limited to ideas such as data aug- 
mentation |4| or self-training 0, each of which involves re- 
fitting the models under consideration — and hence rebuilding 
the corresponding speech engineering systems. While such 
approaches have shown promise, such extreme re-fitting may 
not be desirable — or even possible — in certain settings, for 
instance when a large-scale system is already deployed and 
must be adapted to new test conditions. 

Speech engineering is thus ripe for the introduction of 
new semi-supervised learning approaches; not only can nearly 
limitless amounts of acoustic waveform data be acquired from 
a variety of digital sources, but also many algorithms have 
matured to the point that performance improvements are often 
driven simply by increasing the amount of labeled training 
data. Employing unlabeled data to directly improve existing 
approaches, however, requires inferring the labels — and in 
this context, a natural but unsolved problem is to understand 
whether and how automatically labeled data taken as output 
from current systems can be used to this effect. As indicated 
above, this article brings ideas from robust statistics and 
likelihood-based model selection to bear on this problem, and 
introduces not only a framework to analyze the errors resulting 



from automatic labeling, but also a practical means of treating 
them. 

The article is organized as follows. Section [II] develops 
likelihood-based semi-supervised model selection techniques, 
first considering the case of labeled data, and subsequently 
the unlabeled case. Section [ill] then formulates this semi- 
supervised framework in the speech processing context of 
selecting from amongst competing pronunciation models to 
optimize system performance. Large-scale experiments with 
well-known data sets in Section [TV] then demonstrate that 
this approach achieves state-of-the-art performance in the 
context of speech recognition, spoken term detection, and 
phonemic similarity to a given reference, even when compared 
to the conventional supervised method of forced alignments 
to reference orthographic transcripts. Section [V] concludes the 
article with a discussion of these results and their implication 
for improving speech processing through the use of unlabeled 
development data. 

II. Theory: Likelihood-Based Model Selection 

Viewed from a machine learning perspective, parametric 
statistical models are directly instantiated as large-scale speech 
processing systems. Labeled data are used to fit model pa- 
rameters in the manner described above; e.g., to estimate the 
state transition matrix of a hidden Markov model. In addition, 
one must also typically fit a modest number of parameters 
that alter the structure or function of the model class under 
consideration; for instance, in automatic speech recognition, 
the marginal acoustic likelihood of an utterance typically 
depends on a model for the pronunciation(s) of a given word — 
a setting we return to in Section [HI] 

When training and test conditions match exactly, all pa- 
rameters can be fitted simultaneously during the training 
stage, using principled and efficient procedures such as the 
expectation-maximization algorithm. In practice, however, it 
may be the case that only a small amount of labeled training 
data is well matched to the conditions that prevail during 
test — precluding even cross-validation as an option — or that 
a deployed system must be adapted to new test conditions in 
the absence of its original training data. In such cases it is 
typical to set aside a small amount of development data for 
purposes of model selection as follows. 

A. The Supervised Case: Labeled Development Data 

Recall that in our setting, X represents acoustic waveform 
data, and hence is a continuous random variable. The true but 
unknown data-generating model, then, takes the form of a con- 
ditional probability density function p(x | y) := px \ y (x | Y = 
y). When interpreted for fixed X as a function of unknown 
label Y, this density thus evaluates to the acoustic likelihood 
of X for any given candidate label Y = y. 

In practice, we have access to p(x | y) only through the given 
pairs of training samples {xi, yi), (X2, yi), ■ • and we must 
proceed in the absence of direct knowledge of the true model. 
Any speech processing system will in turn generate its own 
set of putative acoustic likelihoods, and thus it is natural to 
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seek the likelihood function that is closest to the true data- 
generating model p(x \ y), in hopes that this will yield the best 
overall system performance. This leads to a model selection 
problem in which we use the training samples at hand as 
a proxy for p(x \y), to choose amongst competing models 
and build a system that can predict Y given X with minimal 
misclassification error. 

Assume, then, that we have several competing sets of can- 
didate models pi(x | y; 0\),p2(x \ y; 62), ■ ■ ., each dependent 
on distinct parameter sets 0\, 82, ■ ■ ., whose quality we wish 
to evaluate with respect to the true (but unknown) model 
p(x I y). A natural approach is to evaluate the Kullback-Leibler 
divergence of the "best" representative pf.(x\y;6Z) of each 
set from p(x | y), with 91 the maximum-likelihood estimate of 
parameter set 9 k as determined from the training data. Thus 
we seek 

argminEp {\ogp(x \ y)) - E p (\ogp k (x \y;0* k )) 

k 

= argmaxEp (logp k (x \ y; 0* k )) , 

k 

with — E p (\ogp k (x I y; 9'D) sometimes referred to as the 
cross-entropy of p k relative to p, and the corresponding 
optimization task one of cross-entropy minimization. 

Under the assumption of independent and identically dis- 
tributed pairs of training examples, we may form an empirical 
estimate of each cross-entropy simply by evaluating the re- 
spective data log-likelihoods logp k (x | y; 9t) with respect to 
each pair of training samples, and forming the correspond- 
ing arithmetic averages. Assuming the necessary technical 
conditions of [6], it then follows that we may formulate a 

multi-way hypothesis test amongst models pi,p2, We later 

consider this multi-way setting in detail; however, for clarity 
of exposition, we first consider the case of only two competing 
models p\ and P2, which admits three possible outcomes: 

Ho :E p (logpi(a;||/;^))=E p (logp a (x|y;^)) 
Hi : E p (logpi(ac|y;0*)) > ¥. p {\ogp 2 {x\y;9* 2 )) 
H 2 :E p {\og Pl {x \y-0D) < E p (log^z | y; . 

Hypothesis H k thus favors the fcth competing model, with the 
null hypothesis Ho representing their equivalence. 

The natural test statistic in this labeled data setting is then 
given by the log-ratio of likelihoods pi,P2 described above, 
evaluated with respect to training data — possibly even the 
same training data used to fit the maximum-likelihood model 



parameter estimates 



-as follows: 



lab 
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Pi{xj\y ] ;0*i) 
P2{x 3 \yf,ei)' 



(1) 



The careful reader will note that in such a regime, where 
expectations are defined with respect to some unknown distri- 
bution p, we are in fact working with potentially misspecified 
models pi and p 2 , see Q, JS) for properties of maximum- 
likelihood estimation of the parameter sets 9i and 9 2 in this 
setting; for our purposes it suffices to note that such estimators 
still possess the requisite technical properties. 

In the case of interest to us here, the conditional models 
pi and p 2 are assumed to be strictly non-nested, such that no 



conditional distribution in X given Y can be achieved by both 
Pi and p 2 . Vuong |6| shows a central limit theorem for this 
setting when Ho is in force, in that as the number of training 
samples grows large, an appropriately standardized version of 
the test statistic Ti a b is asymptotically distributed as a unit 
Normal. (It is straightforward to proceed in the absence of 
this assumption, with appropriate adjustments to test statistic 
asymptotics.) The necessary normalization is given by the 
sample standard deviation of log-likelihood ratio evaluations 
times the root of the number of training samples; if Ho fails 
to be in force, then the value of this statistic diverges (almost 
surely) to ±00. 

This result in turn implies a concrete directional test 
for model selection: fixing a significance level a yields a 
corresponding critical value z a / 2 according to the standard 
Normal distribution. If the normalized test statistic evaluates 
to greater than z a / 2 , we select model p±; if it evaluates to 
less than —z a / 2 , we decide in favor of model p 2 . Otherwise, 
we conclude that there is insufficient evidence to reject the 
hypothesis Ho of model equivalence, and we conclude that 
models pi and p 2 cannot be distinguished on the basis of the 
given training data and chosen significance level. 

B. The Semi-Supervised Case: Unlabeled Development Data 

Now suppose that our two competing models pi and p 2 
have already been "trained," such that 9i , 9 2 have been fitted 
by maximum-likelihood estimation to obtain 9\ , 9* 2 , but that 
we wish to leverage n additional unlabeled data examples 
xi, x 2 , . . . , x n to accomplish the model selection task de- 



scribed in Section II-A above. Lacking the corresponding 
class labels yi , y 2 , . . . , y n for these data, we thus seek to 
employ automatically generated labels yi,y 2 T--,y n fitted 
respectively by maximum-likelihood under each of the two 
systems, such that we replace the conditional log-likelihood 
ratio of l[T|) by the generalized log-likelihood ratio 



log 



Pi{xj\yi\6t) 

P2(x i \y l ;9^) 



Y, log 

i=l 



ma,XyPi(xj\y; 6>*) 
ma,yL y p 2 (xi \y\9* 2 )' 



(2) 



Of course, maximum-likelihood labeling ("decoding") of Y 
given X incurs some error, and hence it is natural to ask 
under what conditions we can replace Ti a b in the labeled- 
data model selection task of Section [ITA| with p}. Since this 
corresponds to the use of labels taken as output from trained 
systems — i.e., estimated under each of the two competing 
models pi and p 2 — this procedure will inevitably suffer from 
misclassification errors with respect to the estimated labels; if 
systems p\ and p 2 exhibit reasonable performance, however, 
the corresponding marginal error rate e will be small. In the 
limit as e tends to zero, of course, we recover precisely the 



setting of labeled data encountered in Section II-A above 



For the case of small but nonzero e, and assuming now that 
the true data-generating model is either p\ or p 2 , we show be- 
low that a principled model selection procedure may obtained 
by adapting results from the labeled-data setting as follows. 
Each individual likelihood ratio pi(xi | y^, 9\)/p 2 (xi \ yi\ 9 2 ) 
will instead be censored, by bounding its range from above 
and below in order to limit the influence of misclassification 
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errors on the overall model selection procedure. In the limit, as 
we will see, this recovers the well-known nonparametric sign 
test, which simply tabulates for every i = 1,2, ... ,n the sign 
of each log-likelihood ratio, rather than its actual value. As 



we formulate in Section II-C below, this approach sacrifices a 
degree of statistical efficiency for enhanced robustness, which 
in turn enables the influence of errors in the set {yi} of 
automatically generated labels to be limited. 

Not only is this approach intuitively reasonable, but it is also 
provably optimal in a minimax sense, as we now describe. To 
account for the misclassification errors induced by automatic 
labeling, we model the consequence of this inexact labeling 
procedure by replacing the exact conditional densities pi(x \ y) 
and p2 (x | y) with mixtures of these densities and "contam- 
inating" distributions that represent the aggregate effects of 
misclassification. The misclassification error rate e 1 
moreover serves as the mixture weight for each respective 
contaminating density — the so-called e-contaminated case 0. 

Rather than seeking to determine these contaminating dis- 
tributions directly, it is natural to ask if there exists a least 
favorable case: a form of contamination that, for fixed e, 
would serve to maximize the probability of selecting the 
incorrect model p\ or p2- The answer is affirmative: Amongst 
all possible contaminating densities, we are guaranteed that 
a least favorable pair exists whenever the likelihood ratio 
Pi(x | y)/p2(x | y) is monotone and e is small enough to en- 
sure that the corresponding sets of admissible e-contaminated 
mixtures remain disjoint. 

In this case, a result obtained by Huber [2, Theorem 3.2] in 
the context of robust statistics may be applied to show that, 
to minimize this maximal risk of an error in model selection, 
it suffices to consider a specific form of contamination of p\ 
by p 2 , and vice-versa. The precise mixture form required by 
Huber's result is obtained by partitioning the range space of 
pi and p2 in a manner that depends on b > a > as follows: 



Pi(x\-) = (1 - e) • i 

\ap2ix\-_ 



whenever p\ > ap2, 
otherwise; 

whenever p2 > 



I 

P2(x\-) 



\b Pi(x\-) otherwise. 

A likelihood ratio test based on pi/p2 is thus seen to yield 

a if P1/P2 < a, 

Pi{x\-)/p 2 {x\-) ifa<p 1 /p2<b, 
b if pi jp2 > b, 

and hence we have arrived at the minimax test for the case of 
e-contaminated densities pi and p 2 — a test based on likelihood 
ratio evaluations censored from below at a and above at b. 

As noted by Huber, the limiting case occurs when e is suffi- 
ciently large that the sets of e-contaminated mixture densities 
Pi,f>2 cease to be disjoint, and begin to overlap; in our setting, 
this corresponds to the limit as a and b both approach unity. As 
a and b both approach unity, the log-likelihood ratio reflects 
only which term of the comparison is larger, yielding the sign 
test for model selection as described above: 



This test statistic is distributed as a sum of n Bernoulli trials 
whenever the unlabeled examples x±,X2, ■ ■ ■ ,x n are inde- 
pendent and identically distributed, and is hence a binomial 
random variable. As such, we obtain a concrete directional 
test for model selection in the semi-supervised setting, in a 



Tu„iab := # i l ■ log — — | „ m , > 



(3) 



manner that generalizes the supervised setting of Section II-A 
above. 

As in the supervised case, we may fix a significance level 
a and determine a corresponding critical value k a according 
to the binomial distribution with parameters n and p, where 
p = 1/2 under the null hypothesis of model equivalence. 
For a one-sided upper-tail test of size a, we reject Ho in 
favor of Hi if T un i a b > k a , where k a is the smallest integer 
such that Ylk=k (!■) (l) — a; reversm g this inequality 
and summing from zero to k a yields the corresponding one- 
sided lower-tail test. For a fixed alternate with p ^ 1/2, 
the corresponding probability of correct selection is given by 
Y^k=k +i (fc)P (•"■ —p) n ~ k - The sign test has many appealing 
properties; we next investigate its statistical efficiency in this 
context, and refer the reader to |9| for other results. 



C. Analysis: Comparing Statistical Efficacy and Efficiency 

To summarize the results of [2| and |6| as they apply to our 
discussion of model selection above, the best test in the case of 
labeled development data accumulates the log-likelihood ratios 
of each example Xi given its correct label yi, while in the case 
of unlabeled development data the corresponding minimax 
test accumulates the signs of these ratios when evaluated with 
respect to each automatically generated label yi. To compare 
the statistical efficacy of these two testing procedures, we 
may compute their asymptotic relative efficiency under general 
assumptions regarding the limiting distributions of (suitably 
standardized versions of) test statistics 7] a b of ([!} and T un i a b 
of Q obtained under the null hypothesis. 

Asymptotic relative efficiency expresses the limiting ratio 
of sample sizes necessary for two respective tests to achieve 
the same power and level against a common alternative; if one 
test has an asymptotic efficiency of 50% relative to another, 
then the former requires twice as many samples (in the large- 
sample limit) to achieve the same performance. Its computa- 
tion requires knowledge of the asymptotic distributions of both 
test statistics under the null hypothesis, as we now describe. 

Recall that when comparing strictly non-nested models us- 
ing labeled data, a limit theorem holds under the null; let /(•) 
denote the associated density function, with corresponding 
variance a 2 . The so-called efficacy of the labeled-data test is 
in turn given by 1/er under suitable regularity conditions, with 
that of the unlabeled-data sign test given by 2/(0) when T un i a b 
is appropriately standardized |9l . 

The corresponding asymptotic relative efficiency is in turn 
given by the squared ratio of test efficacies, which evaluates 
to the quantity [2er/(0)] 2 . This result implies that when Ti a b 
is asymptotically Normal, the sign test corresponding to ([3]) 
is only 2/ir ~ 64% as efficient as the labeled-data test 
corresponding to ([TJ, since {2a/y/2-Ka 2 ) 2 — 2/n. We may 
in fact generalize this result slightly by following the analysis 
of ifTOl . and considering the so-called generalized Gaussian 
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1 1.2 1.4 1.6 1.8 2 

Generalized Gaussian Exponent 

Fig. 1. Asymptotic relative efficiency of tests in the semi-supervised 
versus supervised settings, when the test statistic of the latter converges to a 
generalized Gaussian distribution with exponent p between 1 (Laplacian) and 
2 (Normal). The horizontal line divides the range of p into cases for which 
the sign test is less efficient than the conventional likelihood ratio test, as in 
the case of the Normal, and vice-versa. 



distribution with location parameter /i and scale parameter a: 



1 



"C(p) 



Ml 



/p(x) " 2aC(p) 1 /pr(i + i/ J5 ) exp 

Here T(-) is the Gamma function, CO) = [r(l /p) /T(3/p)] p / 2 , 
and exponent 1 < p < 2 allows us to interpolate between the 
Laplacian (p — 1) and Normal (p = 2) densities. 

If we thus consider the expression [2er/ p (0)] 2 for asymptotic 
relative efficiency, it follows from the relation T(l + l/p) = 
r(l/p) jp that, as a function of exponent p £ [1,2], the asymp- 
totic relative efficiency for the case of a generalized Gaussian 
distribution having exponent p is p 2 T (3 / p) / [T (1 / p)} 3 . This 
result is illustrated in Figure [JJ which confirms that, were the 
asymptotic distribution of 7] a b to approach a Laplacian density 
with p = 1, rather than a Normal with p = 2, the sign test 
would be twice as efficient in the large-sample limit. 

D. Selecting from Amongst k > 2 Competing Models 

As demonstrated above, the case of two competing hy- 
potheses yields theoretical performance guarantees; however, 
in practice it is often necessary to select from amongst k > 2 
models. While optimality is no longer necessarily retained [2|, 
this problem is of sufficient practical interest to have generated 
a large contemporary literature in machine learning ifTTl . fl2l . 

Of the many approaches described in, e.g., IfTTl . fl2l . sev- 
eral feature pairwise comparisons: in the so-called "one vs. all" 
method, each model is assigned a real-valued score relative 
to all others, and the model with the highest overall score 
is selected. Other possibly approaches include "tournament- 
style," following initial pairwise comparisons, or the case of 
all possible (z) pairwise comparisons. 

The latter approach has been suggested in iTPJl for the 
case of the sign test, and currently remains common practice 
within the machine learning community, despite multi-class 
procedures tailored to specific learning methods [11 1. As 
such, we employ it to select amongst competing pronunciation 
models in our experiments below. 



III. Application: Selecting Pronunciation Models 

As a prototype application of the semi-supervised model 
selection approach derived in Section [II] we now consider the 
task of evaluating candidate pronunciations of spoken words 
in large-scale speech processing tasks. To select amongst 
competing pronunciations, we consider two speech recognition 
systems that differ only in the pronunciation of a particular 
word, and show how to employ both the conventional test 
of ([JJ, using transcribed audio data, and the sign test of ([3]) 
using untranscribed audio data. 

A. Motivation for Semi-Supervised Pronunciation Selection 

The selection of pronunciation models is crucial to several 
speech processing applications, including large-vocabulary 
continuous speech recognition, spoken term detection, and 
speech synthesis, each of which requires knowledge of the 
pronunciation(s) of each word of interest. In this setting, 
a set of admissible pronunciations forms what is termed a 
pronunciation lexicon, which comprises mappings from an 
orthographic form of a given word (e.g., tornados) to a 
phonetic form (e.g., It er n ey d ow z/). 

The conventional means of creating a pronunciation lexicon 
is to employ a trained linguist. However, as is the case 
with other examples requiring data to be hand-labeled by 
experts, this process is expensive, inconsistent, and even at 
times impossible, when individuals lack sufficiently broad 
expertise to create pronunciations for all words of interest 
fl4l . In turn, several approaches for automatically generating 
pronunciations have been put forward fl4l . Ifl5ll . IfTTl . 11191 . 
||20ll , CD . and inevitably a model selection decision must be 
made to choose between candidate pronunciations. However, 
these approaches have themselves relied upon labeled training 
data, in the form of spoken examples of a given word and the 
corresponding orthographic transcripts. 

In addition to the initial creation of a lexicon, pronunciation 
models are also necessary to maintain the vocabulary of speech 
processing systems over time: Although the pronunciation 
lexicon for a given system is created for as large a vocabulary 
as possible before deployment, this lexicon must be extended 
over time to incorporate out-of-vocabulary words. Such terms 
can be new words or names that come into common us- 
age, rare or foreign words, or simply words not deemed 
significantly important at the time a system's lexicon was 
constructed. Dynamically adjusting to changing vocabularies 
thus requires the generation of new pronunciations over time, 
thereby reinforcing the need for an efficient and effective 
means of automatically selecting from amongst candidate 
pronunciations [22], [23], [24]. 

B. Methods for Selecting a Pronunciation Model 

Much effort to date has been focused in the area of au- 
tomatic pronunciation modeling — i.e., grapheme-to-phoneme 
or letter-to-sound rules. Previous work, including [14] and 
JT3J , has attempted to simultaneously generate a set of pro- 
nunciations and select between them. Also, work including 
lfT6ll augments the possible pronunciations by building a larger 
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Word Candidate Pron. Reference Pron. 

guerilla g ax r ax 1 ax g ax r ih 1 ax 
guerilla g w eh r ih 1 ax 

tornados t er n ey d ow z t er n ey d ow z 

tornados t ao r n ey d ow s t ow r n ey d ow z 

TABLE I 

Examples of candidate and reference pronunciations 



phone network to select the pronunciation. Additional re- 
sources are typically required, including existing pronunciation 
lexica [14|, speech samples [19], [20|, linguistic rules |21|, 
or a combination of these. The focus of previous work has 
been on pronunciation variation fl4l . Ifl9l or on common 
words 03], ifTTl . Note that in practice, other concerns may 
dictate choices between competing pronunciations, such as the 
scenario considered in [18], while highlighting the trade-offs 
between word accuracy and overall word error rate (WER). 
In the current setting, however, we are agnostic as to how 
the pronunciations are generated; our goal is simply to choose 
between them. 

To this end, consider the setting in which we have example 
utterances {xi, x 2 , ■ ■ ■ , x n }, their corresponding transcripts 
{yi}, and two "trained" speech recognition systems Pi(-) 
and P2(-) that are identical (i.e., conditioned on the same 
parameters) except that for one word, models p\ and p2 
use different pronunciations, say 9\ for f>i(-;0*) and 9* 2 
for P2('',02)- This corresponds to the case of strictly non- 
nested models outlined in Section|II] We subsequently describe 
and compare a supervised and semi-supervised method to 
select between candidate pronunciations 0\ and 9 2 , and hence 
between models p\ and p2, in settings where candidate words 
are analyzed one at a time (as opposed to comparing entire 
pronunciation lexicons). 

1) Supervised Selection of Pronunciations: The conven- 
tional mechanism for choosing between reference pronunci- 
ations of a word, examples of which are shown in Table [I] is 
to acquire spoken utterances that contain the word, along with 
an orthographic transcription of the utterances, and compute 
a forced alignment of the acoustic waveform data to the 
transcripts, first using one pronunciation and then using the 
other fT4), 05], (20), GQ. The pronunciation that is assigned 
a higher (Viterbi maximum likelihood) score during alignment 
is then chosen. For each word there are a fixed number of 
candidate pronunciations, with at least one (e.g., guerilla) 
reference pronunciation per word, although there may be 
several (e.g., tornados). 

Cast in the notation of Section the conventional super- 
vised method of pronunciation selection proceeds as follows: 

1) Use the sequence of words comprising reference tran- 
scription yl for utterance Xi = Xi to compute the 
log-likelihood ratio 

A i ( Xi \et,e* 2 ,yt i) )= J2 logM^MD 

r- (rof) 

- 1 °SP2(a; l |y;^); 



2) Use the n utterances to form 7] a b and test as follows: 

T x&h = Y,Hxi\ei^y { i et) ) < ^iab; (4) 

i— 1 H.2 

3) Decide between Hi (model/pronunciation 9\) and H.2 
(model/pronunciation 9%) based on the difference in con- 
ditional likelihood evaluations, given forced-alignment 
reference transcripts, as indicated in 

2) Semi-Supervised Pronunciation Selection: The conven- 
tional method of pronunciation selection described above 
requires transcribed audio data whose production is a difficult, 
time-consuming, and laborious task. In many applications, 
external information can potentially alleviate the need for 
transcriptions by identifying recorded speech segments that are 
a priori likely to contain instances of a given word, which in 
turn may be used to select between candidate pronunciations. 
Examples include news items and television shows, each of 
which provides a rich source of untranscribed speech that 
could serve to improve the selection of pronunciations. 

It is furthermore often the case that, while a transcript 
corresponding to spoken examples of a word is unavailable, 
we may have some knowledge that it has occurred in a 
particular audio archive. For example, we may know from 
weather records that a broadcast news episode recently aired 
about natural disasters, giving us a degree of confidence that 
instances of words like tornados are likely to appear. We may 
not know where or how many times such a word occurs in 
a particular audio segment, but we can still use the entire 
broadcast to help us choose between candidate pronunciations 
for tornados, examples of which are given in Table [I] 

In the absence of labeled examples we proposed to use the 
recognition system outputs themselves — unconstrained by any 
forced alignment or reference transcript — to select between 
candidate pronunciations. Each speech recognition system is 
run on every candidate data segment likely to contain a given 
word of interest, and from these results the corresponding 
acoustic likelihoods are evaluated with respect to the entire 
data set, leading to the selection of the candidate pronunciation 
yielding the highest overall likelihood. 

Recalling our notation for the competing models pi (•;#!) 
and P2(-',&2)' w i tn corresponding pronunciations 9\ and 9 2 , 
this semi-supervised approach proceeds in analogy to the 
labeled-data setting as follows: 

1) Form the automatically generated word sequences y\ 1 
and y\ 2 for each utterance Xi — xf. 

yf 1 ' = argmaxpi(y|xi; 9\) 
y 

yf 2 ' = a,rgm&xp 2 (y\xi;9 2 ), 
y 

and use y\ 1 , y\ 2 to compute the log-likelihood ratio 

K{x t \9l,9* 2 ) = logPiO^Mi) 
y^Vi 

- Y log p 2 (xi\y; 62); 
y&Vi 2 
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2) Use the n utterances to form T un i a b and test as follows: 

Hi 

T unlab = # {i : AifaW, 6* 2 ) > 0} > r unlab ; (5) 



3) Decide between Hi (model/pronunciation 0*) and 7Y2 
(model/pronunciation 0%) based on the number of log- 
likelihood ratios that evaluate to be positive, as indicated 
in @. 

IV. Large-Scale Experimental Validation 

We now present an experimental validation of the semi- 
supervised model selection approach presented in the preced- 
ing sections, consisting of selecting between candidate pronun- 
ciations in the context of three prototypical large-scale speech 
processing tasks. For each of 500 different words, forced 
alignment and recognition outputs were produced for every 
pair of pronunciation candidates. Recognition was performed 
on an hour of speech for every word and each corresponding 
candidate, making sure to include somewhere in the data to 
be recognized the same speech utterances that were used in 
the forced-alignment setting, yielding a total of 1000 hours of 
recognized speech. 

The quality of the selected pronunciations was then evalu- 
ated in three different ways: through decision-error trade-off 
curves for spoken term detection, phone error rates relative 
to a hand-crafted pronunciation lexicon, and word error rates 
for large-vocabulary continuous speech recognition. All ex- 
periments were conducted using well-known data sets, and 
state-of-the-art recognition, indexing, and retrieval systems. 

A. Methods and Data 

In order to evaluate the performance of semi-supervised 
pronunciation selection and its suitability for a variety of 
applications (e.g., recognition, retrieval, synthesis), and for a 
variety of word types (e.g., names, places, rare/foreign words), 
we selected speech from an English-language broadcast news 
corpus and identified 500 single words of interest. Common 
English words were removed from consideration, to ensure 
that words of interest would often be absent from lexicons, 
and thus would require pronunciation selection (e.g., Natalie, 
Putin, Holloway), and all words of interest featured in at least 5 
acoustic instances. The selected words of interest were verified 
to be absent from the recognition system's vocabulary, and all 
speech utterances containing these words were removed from 
consideration during the acoustic model training stage. 

For each word of interest, two candidate pronunciations 
were considered, each of which was generated by one of 
two different letter-to-sound systems fl25l : furthermore, the 
500 chosen words all had the property that the two letter-to- 
sound systems produced different pronunciations for them. For 
all subsequent experiments in semi-supervised pronunciation 
model selection, the sign test threshold r un i a b was set at 
Tuniab = n/2 + 1, so that if more than half of the log- 
likelihood ratios evaluated to be positive, then the correspond- 
ing pronunciation model was chosen (i.e., a "winner-takes-all" 
approach). The threshold reflects our a priori belief of equally 
likely candidates, while enforcing our practical goal that one 



Word 


No. Samples 


|Tlab| 


l^unlabl 


Acela 


8 


151.92 


4 


afterwards 


38 


4846.52 


31 


Albright 


247 


34118.11 


230 


Barone 


16 


3011.04 


12 


Beatty 


5 


359.75 


5 


Iverson 


21 


1698.90 


18 


Peltier 


12 


741.12 


9 


Villanova 


6 


902.04 


3 



TABLE II 

Example words and their accumulated test statistics 



candidate or the other must be selected. The sensitivity to 
the threshold depends on the "distance" between models, as 
well as the number of observations. For the experiments in 
supervised pronunciation model selection, the threshold Ti a b 
was set at zero, so that the candidate with the higher log- 
likelihood was chosen. 

To accomplish these experiments, a large-vocabulary contin- 
uous speech recognition (LVCSR) system was built using the 
IBM Speech Recognition Toolkit ll26l with acoustic models 
trained on 300 hours of HUB4 data. Around 100 hours 
were used as the test set for recognition word error rate 
and spoken term detection experiments. The language model 
for the LVCSR system was trained on 400M words from 
various text sources. The LVCSR system's word error rate on 
a standard broadcast news test set RT04 (i.e., distinct from the 
100 hours used for the test set employed below) was 19.4%. 
This LVCSR system was also used for lattice generation in 
the spoken term detection task. The OpenFST-based Spoken 
Term Detection system described in ll27l was used to index the 
lattices and search for the 500 words of interest. For additional 
details regarding the experimental procedures and data sets, the 
reader is referred to 1281. 



B. Experimental Procedure 

To summarize the experimental procedure, two alternative 
pronunciations are generated by two different letter-to-sound 
systems for each of a set of 500 selected words. We also 
have a reference pronunciation for these words from a hand- 
crafted pronunciation lexicon. We assume for the purposes 
of these experiments that the reference pronunciation is not 
available, and we set ourselves the task of choosing between 
two alternative pronunciations for each word, evaluated with 
respect to three different metrics, as will be discussed below. 

The choice between the two pronunciations is made via 



either the supervised method of Section III-B1 (denoted sup) 
or the semi-supervised method of Section III-B2| (denoted 
semi-sup): 

• Sup selects the candidate pronunciation based on super- 
vised forced alignment with a reference transcript; 

• Semi-sup selects the candidate pronunciation based on 
unconstrained (i.e., fully automatic) recognition. 

Some example words of interest and their accumulated test 
statistics are shown in Table [TTJ For each word, the number of 
true speech samples is listed, along with the accumulated log- 
likelihood ratios in accordance with Q, and the corresponding 
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95 



.*? 90 



■a 

o 



80 



T^Y 


anti-oracle - 




semi-sup 

oracle 





















.001 .004 
False Alarm probability (in %) 



.01 



Fig. 2. Decision-error trade-off curves for a spoken term detection task 
[28 ], generated from 100 hours of speech data, using chosen pronunciations 
as queries to a phonetic/word-fragment index. Note that semi-sup and oracle 
overlap at nearly all operating points. 



number of accumulated sign-test samples as per (j5J, in which 
the effect of likelihood censoring is apparent. 

Additionally, we compare the methods described above with 
an oracle and an anti-oracle, defined with respect to the hand- 
crafted lexicon as follows: 

• The oracle selects the candidate that has the smallest edit 
distance to a reference pronunciation of that word 

• The anti-oracle selects the candidate that has the largest 
edit distance to a reference pronunciation of that word 

To illustrate this notion, recall the earlier examples featured 
in Table [I] which lists two words, each with two hypothesized 
pronunciations. In the case of these examples, the oracle 
pronunciation selection method would select the entries 7g 
ax r ax 1 ax/' and 7t er n ey d ow z/'. 



C. Results 

1 ) Spoken Term Detection: Experimental results from 
showing the result of competing approaches to selecting be- 
tween candidate pronunciations for purposes of spoken term 
detection, are shown in Fig. [2] Lattices generated by the 
LVCSR system for the 100-hour test set were indexed and 
used for spoken term detection experiments in the OpenFST- 
based architecture described in 11271 ; the chosen pronunciations 
were used as queries to the spoken term detection system. 
Results from the OpenFST-based indexing system were com- 
puted using standard formulas from the National Institute of 
Standards and Technology (NIST) and scoring functions/tools 
from the NIST 2006 spoken term detection evaluation. Note 
that the decision-error trade-off curves demonstrate that semi- 
sup performs better than the supervised method for detection 
at nearly all operating points. 

2) Phone Error Rate (PER): This experiment measures 
which method — supervised or semi-supervised — selects pro- 
nunciations that have smaller edit distance to a reference 
pronunciation. Referring again to Table [I] as an example, if 
the bolded pronunciations had been selected based on the 
observed speech data, there would be 2 errors out of 6 phones 
with respect to the closest reference pronunciation for guerilla: 



delete Av/ and change /er/ to /ax/, resulting in a 33% PER; for 
tornados: 0% PER. 

We note that while the supervised method requires a 
few acoustic samples of a word of interest, the semi- 
supervised method requires that a few instances of the word be 
recognized — correctly or incorrectly — by the LVCSR system. 
If insufficiently many instances are recognized, then a choice 
between alternative pronunciations cannot be made. Therefore, 
depending on the accuracy of the system, only a subset of 
the 500 words may be resolved (in the sense of having 
a pronunciation selected) by the semi-supervised method. 
Consequently, we employed three different levels of language 
model pruning to yield three levels of system quality, defined 
in terms of word error rate on the standard RT04 data set. The 
resultant error rates on the RT04 data set were 29.3%, 24.5%, 
and 19.4%. 

We report the corresponding phone error rates in Table 
[ill] from which we observe that additional words are indeed 
resolved as system accuracy increases. By way of comparison, 
at the 19.4% WER system setting, the oracle method had a 
PER of 11.51%, and the anti-oracle had a PER of 27.2%. 



It may also be observed from Table III that, for those words 
which are resolved, the semi-supervised method (semi-sup) 
chooses candidates with smaller edit distance to reference 
pronunciations from a hand-crafted lexicon. 

3) Large-Vocabulary Continuous Speech Recognition: As a 



final experiment, all four methods described in Section IV-B 
for selecting between candidate pronunciations were used to 
recognize 100 hours of speech that contained all 500 words 



of interest. Table IV shows a comparison of the results in 
terms of standard word error rates. Note that between the two 
alternative pronunciations, the one with the smaller phoneme 
edit distance to a reference pronunciation may not necessarily 
be the one that results in a lower word error rate. Overall, 
however, a range of about one-half of a percent of WER is 
observed between the best and worst candidates considered; 



note from Table IV that the supervised selection of pronun- 
ciations based on a forced alignment yields a slightly lower 
error rate in this instance than phoneme edit distance. 

Finally, note that the semi-supervised method does as well 



as the supervised method. As shown in Table III of the 449 
words that were resolved, both the supervised method and the 
semi-supervised method selected the same candidate for 392 
of them. Details of the remaining 57 words are presented in 



Table VI Candidate pronunciations are listed in the second 
and third columns, with the better-performing candidate in 
bold, and columns 4 and 5 detail the differing errors due to 
selecting the candidate pronunciation not in bold in terms of 
substitution errors, and insertion/deletion errors. Many of the 
words where the methods chose different pronunciations do 
not impact word error rate — and hence neither is in bold — 
as the two candidate pronunciations are similar enough that 
neither results in a lower WER. 



D. Selecting from Amongst k > 2 Competing Pronunciations 

In practice it may be well necessary to compare more 
than two pronunciations for a given word. For example, 
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Method 

sup 
semi-sup 



System Quality 
(RT04 WER%) 

29.3 

29.3 



No. Words 
Resolved 
359 
359 



PER% 

13.00 
12.64 



System Quality 
(RT04 WER%) 

24.5 

24.5 



No. Words 
Resolved 
390 
390 



PER% 

13.66 
13.19 



System Quality 
(RT04 WER%) 

19.4 

19.4 



TABLE III 

Phone error rates (PER) with respect to a hand-crafted lexicon 



No. Words PER% 
Resolved 

449 14.50 
449 13.87 



Method 

anti-oracle 

sup 
semi-sup 
oracle 



ASR WER% 
17.8 
17.3 
17.3 
17.4 



No. Errors 
193,145 
187,772 
187,424 
188,517 



Method 
mw-anti-oracle 

mw-sup 
mw-semi-sup 
mw-oracle 



ASR WER% 
17.8 
17.0 
17.0 
17.0 



No. Errors 
193,145 
184,345 
184,297 
184,373 



TABLE IV 

Automatic speech recognition (ASR) word error rates (WER) 



TABLE V 

Multi-way (MW) Pronunciation Selection (3 Pronunciations) 



morphologically rich languages may dictate the consideration 
of k > 2 alternative pronunciations for a given orthographic 
form. To demonstrate that our techniques remain appropriate 
in this setting, we adopt here a strategy in which (^) pairwise 
comparisons are performed for the case k = 3. In this 
approach, every unordered pair of candidate pronunciations 
is evaluated using the criteria described above for the anti- 
oracle, sup, semi-sup, and oracle methods. After all pairwise 
comparisons have been completed, the candidate chosen the 



greatest number of times is selected; as noted in Section II-D 
a variety of alternative approaches are also possible. 

For the results that follow, for each of the 449 words 
of interest, an additional third candidate pronunciation was 
considered, taken (as the last entry for a given word) from 
the reference pronunciation lexicon. Word error rate results 
for this three-way comparison are shown in Table [V] The 
anti-oracle method WER remains the same as in the two-way 



case (Table IV i, as every additional candidate had 0% PER, 
and by definition such candidates were not included in the 
anti-oracle set. In a similar fashion, the oracle set contained 
entirely reference pronunciations. 

Relative to the earlier two-way comparison reported in 
Table |IV] the sup and semi-sup sets here contained 288 and 
301 new pronunciations, respectively. The remaining results 
summarized in Table |V] validate the trends observed in the 
two-way comparison, namely that semi-sup and sup perform 
comparably to each other, as well as to the oracle. Also, 
as expected, combining a third pronunciation of high quality 
resulted in lower error rates for all methods it affected. 

V. Discussion 

In showing how censored likelihood ratios may be applied 
in the context of large-scale speech processing, we have de- 
veloped in this article a semi-supervised method for selecting 
pronunciations using unlabeled data, and demonstrated that it 
performs comparably to the conventional supervised method. 
Empirical evidence in support of this conclusion was exhibited 
across three distinct speech processing tasks that depend upon 
pronunciation model selection: decision-error trade-off curves 
for spoken term detection, phone error rates with respect to a 
hand-crafted reference lexicon, and word error rates in speech 
recognition. We have observed these results to be consistent 



across many words of interest, based on extensive experiments 
using state-of-the-art systems and well-known data sets. 

Note that there are limitations to this method, however, 
in the context of pronunciation selection. First, if neither 
candidate is ever recognized, the "unconstrained" recognition 
step required in the semi-supervised setting can fail to choose 
a candidate pronunciation for a word. Also, the approach 
requires having seen textual examples of the word of interest or 
words like it. This seems a reasonable requirement, given that a 
word comes into fashion by being widely noticed. Finally, false 
alarms in the recognition process may degrade performance — 
for example, if a word of interest sounds like common word — 
but our experiments to vary system quality indicated that this 
problem did not arise for the chosen words of interest in our 
setting. 

In summary, the conventional supervised method for 
system-level model selection optimizes empirical performance 
on a labeled development set. Instead, we focused in this 
article on leveraging unlabeled data to choose amongst trained 
systems through likelihood-ratio-based model selection. We 
showed how to generalize the conditional likelihood frame- 
work through the use of automatically generated labels as 
a proxy for labels generated by human experts. We then 
answered the question of how well the resultant censored 
likelihoods are likely to perform, from both a methodological 
and an applied perspective. 

As a final note, a current research direction of much interest 
to the speech community attempts to utilize untranscribed 
utterances for self-training of acoustic model parameters |4j, 
0. While our main interest here was in the general problem of 
non-nested model selection using unlabeled data, an appealing 
direction for future work is to take these ideas forward within 
the acoustic modeling context. 
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Term 


semi-sup 


sup 


Differing Substitution Errors (No.) 


Ins/Del 


Ahem 


ey hh er n 


ae er n 


ahern — » upturn (3), apparent (2), hurry (1) 


6 


Aleve 


ae 1 iy v 


ax 1 eh v 


(0) 


1 


anybody 's 


eh n iy b aa d iy z 


eh n iy b ah d iy z 


(0) 





Asean 


ax s iy ih n 


ey s iy ih n 


asean — » asham (1) 


2 








and — > asean (1) 




Assuras 


ax sh uh r ih s 


ax sh uh r ax z 


(0) 





Avi 


ax v iy 


ey v iy 


(0) 





Beatty 


b iy ae t iy 


b ey t iy 


fabiani — > beatty (1) 


1 


Bhuj 


b uw jh 


b uw zh 


11* 11 ' 1- 1" / 1 \ 

bhuj — > pooch, boost, boots, chip, merge (1) 


5 


Canucks 


k ae n ax k s 


k ae n ah k s 


canucks — » connects (1) 


2 








knox — * canucks (1) 




Cortese 


k ao r t ey z iy 


k ao r t eh z 


cortese — » he (2), tasty, daisy, taste (1) 


5 


Cuellar 


k w eh 1 er 


k y uw 1 er 


cuellar — » korea, out (1) 


2 


Dundalk 


d ah n d ao 1 k 


d ah n d ao k 


(0) 





Dura 


d uw r ax 


d uh r ax 


dura — > dora (1) 





Durango 


d uh r ae ng g ow 


d uh r ae ng ow 


durango — » tarango (1) 


1 


freemen 's 


f r iy m eh n z 


f r iy m ih n z 


(0) 





Gejdenson 


g ey hh d ax n s ax n 


g ey hh d ih n s ax n 


(0) 





Gough 


g ao f 


g ao 


gough — » goff (2), damien (1) 


1 








Schwarzkopf — » gough (1)* 




Grosjean 


g r ow s jh ih n 


g r ow jh iy n 


grosjean — > are, gross (1), on (1)* 


1 


Hadera 


hh ax d eh r ax 


hh ae d eh r ax 


hadera — » era, out (1) 


2 


Heupel 


hh oy p ax I 


hh y uw p ax 1 


heupel — » goals (1) 


1 


Han 


ih 1 ax n 


ay 1 ax n 


ilan — + airline (1) 





ilo 


ay 1 ow 


ih 1 ow 


ilo — > iowa, eyal, low (1) 





Iverson 


ay v er s ax n 


iy v er s ax n 


iverson — » iverson's (14), the (1) 


18 


Jonbenet 


jh aa n b ax n eh t 


jh aa n b ax n eh 


jonbenet — + they (1) 


1 


Jurenovich 


jh uw r eh n ax v ih ch 


y uw r eh n ax v ih ch 


jurenovich — * renovate, renovation (3), average (2) 


22 








jurenovich — » events, pitch (2), want (1) 










jurenovich — > against, batch, each, edge, irrelevant (1) 










jurenovich — + edge, next, now, sh, tournaments (1) 




Kmart 


k ey m aa r t 


k m aa r t 


kmart — + mart (9), answer (2), mark, out (1) 


13 








has — > kmart (1) 




Lampe 


1 ae m p iy 


1 ae m p 


(0) 





liasson 


1 y ae s ax n 


ae s ax n 


liasson — » hanson (1) 


1 


Likud's 


1 ih k ah d z 


1 ay k uw d z 


(0) 





Litke 


1 ih k iy 


1 ih t k iy 


litke — » the (1) 


1 


Lukashenko 


1 uw k ae sh eh ng k ow 


1 uw k ax sh eh ng k ow 


lukashenko — » i (1) 


1 


Marceca 


m aa r s ey k ax 


m aa r s eh k ax 


marceca — » because, cut (1) 


1 








siegel — » marceca (1)* 




Matteucci 


m ax t ey uw ch iy 


m ae t uw ch iy 


matteucci — + see, to (1), matures (1)* 


1 


Menendez 


m eh n eh n d eh z 


m eh n aa n d ey 


menendez — » as (3) 


1 








as — » menendez (3)* 




Milos 


m ay 1 ow z 


m ih 1 ow z 


(0) 





Mustafa 


m ah s t ax f ax 


m uw s t aa f ax 


mustafa — ► some, sun (1) 


1 


Nasrallah 


n ae s r aa 1 ax 


n aa r aa 1 ax 


nasrallah — > rolla, drama, on (1) 


3 


Nhtsa 


n ey t s ax 


n t s ax 


nhtsa — > a, nitze (1) 


2 


Nkosi 


n k ow s iy 


ng k ow z iy 


nkosi — » cozy (1) 


1 


Orelon 


ao r 1 aa n 


ao r ax 1 aa n 


(0) 





Ouattara 's 


w ax t ae r ax z 


aw ax t ae r ax z 


ouattara's — > tara's (1) 


1 


Pawelski 


p ao eh 1 s k iy 


p ao 1 s k iy 


pawelski — » belsky, ski (1) 


2 


Peltier 


p eh 1 t iy er 


p eh 1 t iy ey 


peltier — > tear (2), here, pepsi, years (1) 


5 


pre 


p r ax 


P r 


pre — > per (1) 





Prodi 


p r ax d iy 


p r aa d iy 


(0) 





Sadako 


s ax d aa k ow 


s ae d ax k ow 


sadako — > got (1) 


1 


Schiavo 


s k y ax v ow 


sh ax v ow 


schiavo — » gavel, ski, elbow, oddball, on, out, will (1) 


1 


Schiavone 


s k y ax v ow n 


sh ax v aa n 


schiavone — > bony, bounty (2), a, money, it (1) 


16 








schiavone — > the, voting, about, donate, ioni, owning (1) 




Schlossberg 


sh 1 ao s b er g 


sh 1 aa s b er g 


(0) 





Skurdal 


s k er d ax 1 


s k er d aa 1 


scurbel — » skurdal (1) 











skurdal — » off (1)* 




Taliban s 


t ae 1 ih b ax n z 


t ae 1 ih b ih n z 


metallica — > taliban's (1) 


i 
i 


Thabo 


th aa b ow 


th ax b ow 


thabo — » and, tabor (2) m., problem (1) 


11 








thabo — » hobbled, in, tomlin, trouble, tumbling (1) 




tornados 


t er n ey d ow z 


t ao r n ey d ow s 


(0) 





Yasir 


y ax s iy r 


y aa s iy r 


yasir — » oster (1) 


1 


Yugoslavs 


y uw g ow s I aa v z 


y uw g ow s 1 aa v s 


(0) 





Zhirinovsky 


zh ih r ih n ao v s k iy 


iy r ih n ao v s k iy 


zhirinovsky — > ski, skin, speak (1) 


3 


Zorich 


z ax r ih ch 


z ow r ih k 


zorich — » storage, h., is (2) 


6 



TABLE VI 

Words where the methods differ in selection. Differing errors listed caused by the non-bold pronunciation marked with an *. 



